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Abstract 

We introduce an extension of the XQuery lan- 
guage, FluX, that supports event-based query 
processing and the conscious handling of main 
memory buffers. Purely event-based queries 
of this language can be executed on stream- 
ing XML data in a very direct way. We then 
develop an algorithm that allows to cfhcicntly 
rewrite XQueries into the event-based FluX 
language. This algorithm uses order con- 
straints from a DTD to schedule event han- 
dlers and to thus minimize the amount of 
buffering required for evaluating a query. We 
discuss the various technical aspects of query 
optimization and query evaluation within our 
framework. This is complemented with an ex- 
perimental evaluation of our approach. 

1 Introduction 

XML is the preeminent data exchange format on the 
Internet. Stream processing naturally bears relevance 
in the data exchange context (e.g., in e-commerce). 
An increasingly important data management scenario 
is the processing of XQueries on streams of exchanged 
XML data. While the weaknesses of XML as a 
semistructurcd data model have been observed time 
and again (cf. e.g. ^), XQuery on XML streams can 
be seen as the prototypical instance of the problem of 
queries on structured (vs. flat tuple) data streams. 

Query engines for processing streams are natu- 
rally main-memory-based. Conversely, in some ef- 
forts towards developing main-memory XQuery en- 
gines whose original emphasis was not on stream pro- 
cessing (e.g., BEA's XQRL 0), it was observed that 
it is worthwhile to build such systems using stream 
processing operators. 
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The often excessive need for buffers in current 
main memory query engines causes a scalability is- 
sue that has been identified as a significant research 
challenge [H [HI [H [3 E] . While the efficient eval- 
uation of XPath queries on streams has been worked 
on extensively in the past (here, state-of-the-art tech- 
niques use very little main memory) |21 IIUI lll| . 
not much work has been done on efficiently process- 
ing XQuery on streams. The nature of XQuery, as a 
data-transformation query language entirely different 
from node- selecting XPath, requires new techniques 
for dealing with (and reducing) main memory buffers. 
State-of-the-art XQuery engines consume main mem- 
ory in large multiples of the actual size of input XML 
documents ^^. 

Several recent projects have addressed XQuery on 
streams using transducer networks [131 I16| . Auto- 
mata-based techniques are usually quite elegant but 
are hard to compare or integrate with other approaches 
and usually do not generalize to real-world query lan- 
guages such as (full) XQuery with their great expres- 
sive power and all their odd features and artifacts of 
the standardization process. One approach |14j to- 
wards addressing the problem of reducing main mem- 
ory consumption in an engine for full XQuery aims at 
reducing the amount of data buffered in main memory 
by pre-filtering the data read from the stream with 
the paths occurring in the query. However, for real- 
world XQueries, the need for substantial main memory 
buffers cannot be avoided in general. 

An important goal is thus to devise a well-principled 
machinery for processing XQuery that is parsimonious 
with resources and allows to minimize the amount of 
buffering. Such machinery needs to be based on in- 
termediate representations of queries that are syntac- 
tically close to XQuery and has to allow for an alge- 
braic approach to query optimization, with buffering 
as an optimization target. This is necessary to allow 
for both extensibility and the leverage of a large body 
of related earlier work done by the database research 
community. However, to our knowledge, no principled 



work exists on query optimization in the framework 
of XQuery (rather than automata) for structured data 
streams (such as XML, but unlike flat tuple streams) 
which honors the special features of stream process- 
ing. Moreover, no framework for optimizing queries 
on structured data streams exists that captures the 
spirit of stream processing and allows for query opti- 
mization using schema information. (However, there 
are XQuery algebras meant for conventional query pro- 
cessing ^1|S], and there is work on applying them in 
the streaming context jZj. Moreover, the problem of 
optimizing XQueries using a set of constraints holding 
in the XML data model - rather than a schema - was 
addressed in [H].) 

In this paper, we attempt to improve on this situ- 
ation. We introduce a query language, FluX, which 
extends XQuery by a new construct for event-based 
query processing called process-stream. FluX moti- 
vates a very direct mode of query evaluation on data 
streams (similar to query evaluation in XQRL j^), 
and provides a strong intuition for what main mem- 
ory buffers are needed in which queries. This allows 
for a strongly "buffer-conscious" mode of query opti- 
mization. The main focus of this paper is on automati- 
cally rewriting XQueries into event-based FluX queries 
and at the same time optimizing (reducing) the use of 
buffers using schema information from a DTD. 

Consider the following XQuery Q in a bibliogra phy 
domain, taken from the XML Query Use Cases [T9] 
(XMP Q3): 

<results> 

{ for $b in $RDOT/bib/book return 

<result> { $b/title } { $b/author } </result> } 
</results> 

For each book in the bibliography, this query lists its 
title(s) and authors, grouped inside a "result" element. 
Note that the XQuery language requires that, within 
each book, all titles are output before all authors. 
The DTD 

<! ELEMENT bib (book)*> 

<! ELEMENT book (title I author) *> 

specifies that each book node may have several title 
and several author children. A priori, no order among 
these items is inferable from the given DTD. To imple- 
ment this query, we may output the title children in- 
side a book node as soon as they arrive on the stream. 
However, the output of the author children needs to 
be delayed (using a memory buffer) until we reach the 
closing tag of the book node (at that time, no further 
title nodes may be encountered). Then we may flush 
the buffer of author nodes, empty it, and later refill it 
with the author nodes from the next book. 

We thus only need to buffer the author children of 
one book node at a time, but not the titles. Current 
main memory query engines do not exploit this fact. 



and rather buffer either the entire book nodes or, as 
an optimization |14) . all title and all author nodes 
of book. Previous frameworks for evaluating or opti- 
mizing XQuery do not provide any means of making 
this seeming subtlety explicit and reasoning about it. 
The process-streeun construct of FluX allows to 
express precisely the mode of query execution just de- 
scribed. XQuery Q is then phrased as a FluX query 
as follows: 

<results> 

{ process-stream $RDOT: on bib as $bib return 
{ process-stream $bib: on book as $book return 
<result> 
-[ process-stream $book: 

on title as $t return {$t}; 
on-first past (title, author) return 
{ for $a in $book/author return {$a} } } 
</result> } } 
</results> 

A process-strecun $x expression consists of a num- 
ber of handlers which process the children of the XML 
tree node bound by variable $x from left to right. An 
"on a" handler fires on each child labeled "a" vis- 
ited during such a traversal, executing the associated 
query expression. In the process-stresmi $book ex- 
pression above, the on-first past (title, author) 
handler fires exactly once as soon as the DTD im- 
plies for the first time that no further author or 
title node can be encountered among the children 
of $book. (As observed above, in the given, very 
weak DTD, this is the case only as soon as the last 
child of $book has been seen.) In the query associated 
with the on-first past (title, author) handler, we 
may freely use paths of the form $book/author or 
$book/title, because such paths cannot be encoun- 
tered anymore and we may assume that the query en- 
gine has already buffered all matches of these paths for 
us. It is a feasible task for the query engine to buffer 
only those paths that the query actually employs (see 
also [n]). 

We call a query safe for a given DTD if, informally, 
it is guaranteed that XQuery subexpressions (such as 
the for-loop in the query above) do not refer to paths 
that may still be encountered in the stream. The above 
FluX query is safe: The for-cxprcssion employs the 
$book/author path, but is part of an on-first handler 
that cannot fire before all author nodes relative to 
$book have been seen. 

If the path $book/author was replaced by, say, 
$book/price and the DTD production for book were 

<!ELEMENT book ((title I author)*, price)> 

then the FluX query above would not be safe. In that 
case, on the firing of on-first past (title , author) , 
the buffer for $book/price items would still be empty 
and the query result would be incorrect. 

Query Q can be processed more efficiently with the 
schema used in the XML Query Use Cases, 



<! ELEMENT bib (book)*> 

<!ELEMENT book (title, (author+ I editor+) , 

publisher , price) > 

Here, no buffering is required to execute our query 
because the DTD asserts that for each book, the ti- 
tle occurs strictly before the authors (we denote this 
as Ordbookltitle, author), called an order constraint). 
We may phrase our query in FluX so as to directly copy 
titles and authors to the output as they arrive on the 
input stream. No data items need to be buffered. 

<results> 

{ process-stream $RQOT: on bib as $bib return 

{ process-stream $bib: on book as $book return 
<result> 
■[ process-stream $book: 

on title as $t return {$t}; 
on author as $a return {$a} } 
</result> } } 
</results> 

The contributions of this paper are as follows. 

• We introduce the FluX query language, which ex- 
tends XQuery by the natural stream processing con- 
struct discussed above. 

• We define the safe FluX queries (under a given 
DTD), which are those FluX queries in which 
XQuery subexpressions have the usual semantics 
(i.e., are never executed before the data items re- 
ferred to have been fully read from the stream and 
may be assumed available in main memory buffers). 

• We present an algorithm that schedules XQueries 
on streams using DTDs and transforms them into 
optimized FluX queries. 

• We discuss the realization of query engines for FluX 
and the runtime buffer management. 

• We have built a prototype FluX query engine which 
we evaluate by means of a number of experiments. 

This is, to our knowledge, the first work on opti- 
mizing XQuery using schema constraints derived from 
DTDs"'^ . A main strength of the approach taken in this 
paper is its extensibility, and even though space limita- 
tions require us to restrict our discussion to a (power- 
ful) fragment of XQuery, our results can be generalized 
to even larger fragments. In our discussion at the end 
of the paper, we will also lay the foundations for alge- 
braic optimization of queries using further information 
from the schema. 

This paper is structured as follows. We start with 
basics on DTDs and regular languages in Section 2. 
Section |21 defines the query languages considered in 
this paper: Section l3 . II specifics an XQuery fragment. 
Based on this, Section IT^ defines the FluX language. 



^To simplify presentation we restrict ourselves to DTDs, 
but the required information could also be derived from XML 
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and Section rOl singles out the safe FluX queries. Sec- 
tion 01 presents our algorithms for translating XQuery 
into a particular normal form (Section 14. 1() and for 
transforming this normal form into FluX fSection l4.2|l . 
Some examples of this transformation are given in Sec- 
tion 14.31 In Section [S] we discuss the implementa- 
tion of our prototype system and the actual handling 
of buffers during query evaluation. In Section El we 
present our experiments, and we conclude with a dis- 
cussion in Sectional 

2 Preliminaries 

For simplicity of exposition, we consider the fragment 
of XML without attributes as our data model. Note 
that this is no substantial restriction, since attributes 
can be handled in the same way as subelements. 

We focus on valid documents, i.e. documents con- 
forming to a given document type definition (DTD). 

Let S be a set of symbols (or tag names). A DTD 
is an extended context free grammar over E. DTDs 
are local tree grammars ^^Ij i-S- without competing 
nonterminals to the left-hand sides of productions, so 
each production in a DTD is unambiguously identified 
by a tag name in S. 

Let p be a regular expression and let symb(p) be 
the set of atomic symbols that occur in p. By L{p) we 
denote the language defined by p, i.e., the set of words 
over symb(p) that are recognizable by p. Given a word 
w, let Wi denote its i-th symbol. We define a binary 
relation Ordp C S x S such that for a, 6 € S, 

Ordp{a, b) :<^ '^w € L{p) : Wi ^ b /\ Wj = a A i < j. 

That is, Ordp{a, b) holds if there is no word in L{p) in 
which a symbol a is preceded by a symbol b. (All a 
symbols occur before all b symbols.) We refer to a con- 
straint of the form Ordp{a,b) as an order constraint. 

Example 2.1 Let p == (a*.6.c*.(d|e*).a*). Then, 
Ordp{b,c), Ordp{c,d), and Ordp{c,e), but -^Ordp{a,c). 
Ordp is transitive, so we also have e.g. Ordp{b,d). D 

DTDs have the nice property that regular expres- 
sions appearing in the right-hand sides of productions 
arc one-unambiguous . This guarantees that an equiv- 
alent deterministic finite automaton can be computed 
in polynomial - even quadratic - time [3] . 

One can show the following: 

Proposition 2.2 Given a regular expression p from a 
DTD, Ordp can be computed in time 0(|p|^). 

Let p be a regular expression and let 5 C S. Then, 
for each word u = mi . . . u„ G symb{p)* , 

Pastp^s{u) :<^ 

Vw S symb{p)* : uw £ L{p) -^ $ i : Wi & S, 

first-past g{u) :<^ 
Pastp^s{u) A (n > ^ ^Pastp^s{ui . ■ .u,i_i)). 



Intuitively, when processing a word uw £ L{p) from 
left to right, if first-past g{u) holds, then the reading 
of the last symbol of u is the earliest possible time at 
which we know that none of the symbols in S can be 
seen anymore until the end of the word uw. 

Appendix^ shows how order constraints as well as 
Past and first-past events can be efficiently checked for 
a given DTD. 

3 Query Language 

In this section, we define the syntax and semantics of 
the FluX query language, which extends an XQuery 
fragment, denoted as XQuery", by a construct for 
event-based query processing. 

Before defining FluX and XQuery", we need some 
more notation. We write $x, $y, $z, ... to denote 
variables that range over XML trees. In the following, 
we overload the meaning of variable $x bound to an 
XML tree whose root is labeled a, by writing $x when 
we actually mean the DTD production unambiguously 
identified by the element a. For example, if the DTD 
contains the rule <! ELEMENT a p°'> for a regular ex- 
pression p'^, we write Ord^^ic, d) instead of Ordpo. (c, d), 
and we write symb{$x) instead of symb{p'^). 

A fixed path is a sequence ai/.../a„, where the 
ai are symbols from the DTD and n > 1. XPath 
expressions such as a/ * /b^ or a/ /b or a[b] arc excluded. 

An atomic condition is either of the form 
Sx/tt RelOp s, exists Sx/tt, or $x/tt RelOp Sy/ir', 
where s is a string, n and tt' are fixed paths, and 
RelOp e {=,<,<,>,>}• A condition is a Boolean 
combination (using "and", "or", "not", and "true") 
of atomic conditions. 

3.1 An XQuery Fragment: XQuery 

Definition 3.1 (XQuery^) The XQuery fragment 
XQuery^ is the smallest set consisting of expressions 

1. e (the empty query) 

2. s (output of a fixed string) 

3. a /3 (sequence) 

4. { for $x in S^/tt return a } (for-loop) 

5. { for $a; in $y/TT where x return a } (condi- 
tional for-loop) 

6. { Sx/tt } (output of subtrees reachable from node 
$a; through path tt) 

7. { $a; } (output of subtree of node $x) 

8. { if X then a } (conditional) 

where tt is a fixed path, s a fixed string, x a- condition, 
and a and /3 arc XQuery" expressions. 

Indeed, XQuery" is very similar to (a fragment of) 
standard XQuery ^Hj j but differs in how we treat fixed 
strings inside queries. For example, the string <hello> 
is valid in XQuery" , but not in standard XQuery. The 
query 



<result> i $RDOT/bib/book } </result> 

is understood in standard XQuery as a "result" node 
with an embedded query to produce its children. In 
the present paper, the same query is read as a sequence 
of three queries which write the string <result>, the 
/bib/book subtrees, and finally the string </result> 
to the output. 

This, however, is only a subtlety which, on the one 
hand, is very convenient for obtaining our main re- 
sults in Section ^ and which, on the other hand, as 
the following Proposition 13.21 shows, does not cause 
any problems. The alternative semantics of XQuery^ 
is the basis of optimizations used internally by the 
query engine. Users formulate input queries in stan- 
dard XQuery and may assume the usual semantics. 

Let IQj XQuery- (D) (rCSp., lQjxQuery{D)) dcnotC 

the XML document stream produced by evaluating 
query Q on document D under our XQuery" semantics 
(resp., under the standard XQuery semantics [IHI). 

Proposition 3.2 Let Q be an XQuery that parses as 
an XQuery~ query. Then, for any input document D, 

MxQuer^-{D) = MxQuery{D). 

3.2 Syntax and Semantics of FluX 

A .simple expression is an XQuery" expression of the 
form a P "f where 

• a and 7 are possibly empty sequences of strings and 
of expressions of the form "{if x then s}", where 
X is a condition and s is a string. 

• /3 is either empty, "{$w}", or "{if x then {$it}}", 
for some variable $u and some condition x- 

• if /3 is of the form "{$u}", or "{if x then {$u}}", 
then no atomic condition that occurs in a /3 contains 
the variable $u. 

For instance, 

<a>{$a-}</a> {if $a;/b=5 then <b>5</b>} 

is a simple expression, but {$a;}{$2/} is not. 

Definition 3.3 (FluX) The class of FluX expres- 
sions is the smallest set of expressions that are either 
.simple or of the form 

s { process-stream $y: C } ■^' 

where s and s' are possibly empty strings, $y is a vari- 
able, and C is a list (where entries are separated by 
semicolons " ; " ) of one or more event handlers. Each 
event handler is of one of the following two types: 

1. (so-called "on-first" handler) 

on-first pastCS*) return a 

where S C symb{$y) and a is an XQuery" expres- 
sion 



2. (so-called "on" handler) 

on a as $x return Q 

where $a; is a variable, a is an element name in 
sy'mb($y), and Q is a FluX expression. 

We will use ps as a shortcut for process-streEiin, 
on-first past(*) as an abbreviation for on-first 
past ( symb ($y) ) , and furthermore on-first pastO 
in place of on-first past(0). 

Some examples of FluX expressions, as well as an in- 
formal description of the FluX semantics, were already 
given in Section ^ further examples can be found in 
Section 1131 In general, we evaluate an expression 

{ process-stream $y: (" } 

as follows: An event-handling statement considers the 
children of the node currently bound by variable $y as 
a list (or stream) of nodes and processes this list one 
node at a time. On processing a node v with children 
ti,...,t„, with the labels of U denoted as label{ti), 
we proceed as follows. For each i from to n+1 
(i.e., n+2 times), we scan the list of event handlers 
(^ = ^i; . . . ; (^m once from the beginning to the end. In 
doing so, we test for each event handler Q whether 
its event condition is satisfied, in which case the event 
handler Q "fires" and the corresponding query expres- 
sion is executed: 

• A handler "on a as $x return Q" fires if 1 < * < 
n and label{ti) = a. 

• A handler "on-first past (5) return a" fires if 
Q < i < n and first-pastry g{label{ti) ... label{ti)) 
is true (i.e., for the first time while processing the 
children of $j/, no symbol of S can be encountered 
anymore) or if i = n-\-l and this event handler has 
not fired in any of the previous (n4-l) scans. 

In summary, it is well possible that several events fire 
for a single node, in which case they are processed in 
the order in which the handlers occur in C,. During 
the run on ii, . . . , i„, each "on" handler may fire zero 
up to several times, while each "on-first" handler is 
executed exactly once. 

For a FluX or XQuery" expression Q, let free{Q) be 
the set of all free variables in Q, defined analogously to 
the free variables of a formula in first-order logic. That 
is, free{{%x / tt}) = {$x}, and /ree({if x then a}) 
consists of free{a) and the variables that appear in %. 
Further, /ree({f or %x in Sy/vr return a}) contains 
the variable $y and the variables in free{a) \ {$a;}. 
Finally, /ree({process-strecmi $y: ( }) consists of 
the variable $y, for each event handler in ( of the form 
"on-first past (5) return a" also of the variables 
in free{a) , and likewise for each event handler in C of 
the form "on a as $x return Q" of the variables in 
free{Q)\{$x}. 



Note that expressions of the form "{for $.t in 
$y/a return a}" and event handlers of the form "on 
a as $x return Q" femrf the variable $x, i.e., remove 
it from the free variables of the superexpressions. 

A FluX query is a FluX expression in which all free 
variables except for the special variable $RDDT corre- 
sponding to (the root of) the document are bound. 
That is, for a query Q in FluX (resp., a in XQuery") 
we require that free{Q) C {$ROOT} (resp., free{a) C 
{$ROOT}). 

As the following example shows, every XQuery" 
query can be transformed into a FluX query in a 
straightforward way. 

Example 3.4 Every XQuery" query a is equivalent 
to the FluX query 

{ ps $RODT: on-first past(*) return a } 

In Section ^ below we will show how, depending on a 
given DTD, this FluX query can be transformed into 
an equivalent FluX query that can be evaluated more 
efficiently. D 

By the size of an expression Q, denoted \Q\, we refer 
to the size of its string representation. 

By the parent variable of (FluX or XQuery") ex- 
pression a in FluX query Q, denoted parentVar{a) , 
we refer to the variable bound by the nearest superex- 
pression of a, or $ROOT if no such variable exists. 

By the condition paths in a, we refer to the set of 
paths %x/'iT in a condition x that occurs in a. 

For FluX or XQuery" expressions a and (3 we 
write a ^ (3 (resp., a -< /3) to denote that a is a 
subexpression (resp., proper subexpression) of (3. An 
XQuery" subexpression a of a FluX expression Q is 
called maximal if there is no XQucry^ expression j3 
with a < (3 <Q. Note that a FluX query may contain 
several such maximal expressions. 

Example 3.5 The maximal XQuery" subexpressions 
of the first FluX query from Scctionnarc "{$t}" and 
"{ for $a in Sbook/author return {$a} }" . D 

3.3 Safe Queries 

We next define the notion of safety for FluX queries. 
Informally, a query is called safe for a given DTD if 
it is guaranteed that XQuery" subexpressions do not 
refer to paths that might still be encountered in an 
input stream compliant with the given DTD. For the 
precise definition we need the following notion. 

The set of dependencies w.r.t. variable %y in a FluX 
or XQucry^ expression a is defined as 

dependencies($y, a) := 

{a I ex. a condition path $y/a or %y / aj-n in a} U 
{b I ex. %u, TT, Q s.t. TT starts with symbol b and 
"{for %u in $y/7r return Q}" ^ a}. 



Definition 3.6 (safe queries) A FluX query Q is 
called safe w.r.t. a given DTD if, and only if, for each 
subexpression "{ps $y: ( }" of Q, the following two 
conditions are satisfied: 

1. For each handler "on-first pastCS*) return a" 
in the list (, the following is true: 

• V 6 £ dependencies{%y , a) we have: 6 G 5* or 
eyi. a ^ S s.t. Ord$y{b, a) 

• V $z G free{a) s.t. {$z} ^ a or {Sz/tt} ^ a (for 
some tt) we have: $z = $y and V 6 G symb{$y): 
b G S or ex. a € S s.t. Ord$y{b, a). 

2. For each handler "on a as $x return Q" in the 
list ^, and for each maximal XQuery" subexpres- 
sion a of Q, the following is true: 

• V 5 e dependencies{$y, a) we have: Ord$y{b, a) 

• if a = Q (note that according to Definition l3.3l 
a must then be simple), then for all $u s.t. 
{$u} :< a we have: $u = $x. 

It can be shown that this notion of safety is suffi- 
cient to ensure that main memory buffers are fully pop- 
ulated when they are accessed by a query, i.e., that a 
FluX query can be evaluated in a straightforward way 
on input streams compliant with the given DTD. 

Examples of safe FluX queries can be found in Sec- 
tions Q] and ^ (To be precise, all FluX queries occur- 
ring in this paper are safe.) 

4 Translating XQuery into FluX 

In this section we address the problem of rewriting 
a query of our XQuery fragment into an equivalent 
FluX query that employs as little buffering as possi- 
ble. This rewriting proceeds in two steps: First, we 
transform the given XQuery" query into an equiv- 
alent query in XQuery" normal form, f Section I4.1|l . 
Afterwards, depending on a given DTD, this normal- 
ized query is rewritten into an equivalent safe FluX 
query fSection |4.2(l . The FluX extensions manage the 
event based, streaming execution of the query. All 
subqueries exclusively working on buffered data are 
XQuery" expressions. 

4.1 A Normal Form for XQuery 

An XQuery" expression is transformed into normal 
form by rewriting (subexpressions of) it using the rules 
in Figure n] until no further changes are possible. 

In an XQuery" expression in normal form, the fol- 
lowing three properties hold: (1) All paths except 
those inside conditionals are simple-step paths, i.e. of 
the form $x/a. (2) An expression in normal form does 
not contain any conditional for-loops, as the normal- 
ization process pushes conditionals inside the inner- 
most for-loops. (3) For each subexpression of the form 



{ for $a; in ty/iv where x return f3 } 
{ for $a; in Sy/vr return { if x then /3 } } 

{ $yA } 

{ for $2; in $y/'iv return {$a::} } 



{ for $x in %y/a/TY return /3 } 
{ for %Xq in %y/a return 

{ for $a; in $2;o/7r return /3 } } 



{%xo new) 



{ if X then { for $a: in Sy/vr return a } } 
{ for %x in Sjz/tt return { if x then a } } 

{ if X then a /? } 
{ if X then a } { if x then (3 } 

{ if X then { If ip then Q } } 
{ if (x and ^) then a } 

Figure 1: Normal form rewrite rules. Each rule is 
always applied downwards, i.e., the expression above 
the line is replaced by the expression below the line. 

"{if X then a}", a is either a fixed string or of the 
form "{$x}" for some variable %x. 

Theorem 4.1 The rule applications of Figure^ can 
be implemented in such a way that the rewriting termi- 
nates for an input XQuery~ expression Q after 0{\Q\) 
rule applications with a unique result, the so-called 
normalization of Q, which is equivalent to Q. 

Example 4.2 ([191, XMP, Ql) Consider the fol- 
lowing XQuery Qi for books published by Addison- 
Weslcy after 1991, including their year and title. 

<bib> 

i for $b in $RDOT/bib/book 

where $b/publisher = "Addison-Wesley" and 
$b/year > 1991 

return <book> {$b/year} {$b/title} </book> > 
</bib> 

We abbreviate the where-condition in the above query 
as X- Then Qi has the following normalization Q'l. 

<bib> 

{ for $bib in $RDOT/bib return 
■[ for $b in $bib/book return 
{ if X then <book> } 
{ for $year in $b/year return 

{ if X then {$year} } } 
{ for $title in $b/title return 

{ if X then {$title} } > 
{ if X then </book> > } } 
</bib> 

D 

4.2 Rewriting normalized XQuery into FluX 

To formulate our main rewrite algorithm for trans- 
forming normalized XQuery" queries into equivalent, 
safe FluX queries, we need some further notation. 



1 function rewrite(Variable parent Var, Set(E) H, 

2 XQuery" /3) returns FluXQuery 

3 begin 

4 let $x = parent Var; 

5 if {Sx} ^ /? then 

6 begin 

7 if /3 is simple and dependencies{$x, (3) = then 

8 return j3 

9 else 

10 return { ps $x: on-first past(*) return (3 } 

11 end 

12 else /* {$x} y<l3*/ 

13 begin 

14 if /3 = /3i 132 then 

15 begin 

16 P'l '■— rewrite (parent Var, H, /3i); 

17 match (i such that /3( = { ps Sx: (^i }; 

18 /32 := rewrite (parent Var, H VJ hsynih{C,\) , (32); 

19 match ("2 such that /32 = { ps $s: C2 }; 

20 return { ps $2;: Cii C2 } 

21 end 

22 else if [3 is simple then 

23 /* /3 is either of the form s or { if x then s } */ 

24 return { ps %x: 

25 on-first past{dependencies{%x,l3) U H) 

26 return /3 } 

27 else if (3 is of the form 

28 { for %y in %z/a return a } then 

29 begin 

30 X := {b £ dependencies {$x, a) UH \ -^Ordg^{b,a)}; 

31 if $2 7^ $x then 

32 return { ps $2:: on-first past(X) return (3 } 

33 else if X 7^ then 

34 return { ps $x: on-first past(X U {a}) return f3 } 

35 else 

36 begin 

37 a' := rewrite($j/, 0, a); 

38 return { ps $2;: on a as $y return a' } 

39 end 

40 end /* if /? is for-expression */ 

41 end /* else {$x} y<P*/ 

42 end 

Figure 2: Algorithm for rewriting XQuery^ into FluX. 

Let S be the set of tag names occurring in the given 
DTD. Let _L denote the empty list. For a list ( of 
event handlers, we inductively define the set hsymb{Q 
of handler symbols for which an "on" handler or an 
"on-first" handler exists in <^: 

hsymb{L) := 
hsymb(C,] on a as %x return a) :~ hsymh{C,) U {a} 

/is?/m&(^; on-first past (S*) return q) := 

hsym,h{C,) U 5* 

Our algorithm for recursively rewriting normal- 
ized XQuery" expressions into FluX is shown in Fig- 
ure [3 Note that this algorithm uses order constraints 
and hence depends on the underlying DTD. Given 
query Q, we obtain the corresponding FluX query as 
"rewrite($ROOT, 0, Q)". Some example runs of this al- 



gorithm are given in Section 14.31 below. The goals in 
the design of the algorithm were to produce a FluX 
query which (1) is safe w.r.t. the given DTD, (2) is 
equivalent to the input XQucry. and (3) minimizes the 
amount of buffering needed for evaluating the query in 
an XML document. 

To meet goals (1) and (2), e.g. the particular or- 
der of the if-statements in the algorithm (lines 5, 14, 
22; 27) is crucial. Also, a set H of handler symbols 
must be passed on in recursive calls of the algorithm, 
because otherwise the resulting FluX query would not 
be safe. One important construct for meeting goal (3) 
is the case distinction in lines 31-39, where an "on" 
handler is created provided that this is safe, and an 
"on-first" handler is created otherwise. 

Theorem 4.3 Given a DTD D and a normalized 
XQuery^ query Q, ■'rewrite($ROOT, 0, Q) " runs in 
time 0(|Dp -f- \Q\'^) and produces a safe FluX query 
that is equivalent to Q on all XML documents compli- 
ant with the given DTD. 

Our algorithm performs only a single traversal of 
the query tree. Runtime 0(|(5p) is mainly caused by 
the need to compute dependencies. Note that the re- 
sulting FluX query is in normal form. 

4.3 Examples 

We now discuss the effect of our rewrite algorithm on 
sample queries from the XQuery Use Cases |19).^ 

Example 4.4 (HI], XMP, Q2) Let us consider 
the XQuery Q2 from the XQuery Use Cases [T^ . 
which creates a flat list of all the title-author pairs, 
with each pair enclosed in a result element. Due to 
space limitations we omit Q2 here and only give its 
normalization Q2 (which is very similar to the original 
XQuery Q2): 

1 <results> 

2 ■[ for $bib in $RDOT/bib return 

3 { for $b in $bib/book return 

4 { for $t in $b/title return 

5 { for $a in $b/author return 

6 <result> {$t} {$a} </result> } > > } 

7 </results> 

When given a DTD that does not impose any order 
constraints on title and author, e.g., the first 
DTD from Section [H then "rewrite($ROOT,0,(3^)" 
proceeds as follows: First, Q'2 is decomposed into 
two subexpressions /3i, consisting of line 1, and P2, 
consisting of lines 2-7. Then, the rewrite algorithm is 
recursively called for /?i and for /?2. As /3i is simple, 
the call for /3i produces the result 

{ps $RODT: on-first pastO return <results> } 



We rewrite the queries to work without attributes. 



The call for /?2 decomposes p2 into two subexpressions 
/321, consisting of lines 2-6, and /322, consisting of line 
7 of Q'2. The recursive call "rewritc($ROOT,0,/32i)" 
then executes lines 36-39 of the algorithm in Figure [21 
because /32i is a for-loop with parent variable $ROOT 
and associated set X = Xp^-^ ~ 0. That is, the result 

{ps $RODT: on bib as $bib return a'l } 

is produced, where a'l is the result produced by the 
recursive function call "rewrite($bib,0,ai)", for the 
subquery ai of Q'2 in lines 3-6. This recursive call 
for ai again executes lines 36-39 of the algorithm, 
producing the expression a'l = 

{ps $bib : on book as $b return 02 } 

where a'2 is the result of "rcwritc($b,0,a2)" for 
the subquery 02 of Q'2 in lines 4-6. As a2 is a 
for-loop with parent variable $b and associated set 
X — Xa2 = {author}, in this call line 34 of the 
algorithm is executed, producing the expression a'2 — 

{ps $b: on-first past (author , title) return Qf2 } 

All in aU, "rewrite($ROOT,0,(32)" returns the fol- 
lowing FluX query F2: 

1 {ps $ROOT: 

2 on-first pastO return <results>; 

3 on bib as $bib return 

4 {ps $bib : on book as $b return 

5 {ps $b: on-first past (author .title) return 

6 { for $t in $b/title return 

7 { for $a in $b/author return 

8 <result> {$t} {$a} </result> } } } >; 

9 on-first past (bib) return </results> }■ 

We will refer to the "{ps $b • • • }" -expression in lines 
5-8 of F2 as ttj. When evaluating the query F2 on an 
XML document, the XQuery inside 012 '^ili be evalu- 
ated once all author and all title nodes have been 
encountered and buffered. 

Let us now consider the case where we are given a 
DTD with the production 

<! ELEMENT book (author* .title*) > 

where the order constraint Orrfbook(author, title) is 
met. While running "rewritc($ROOT,0,(J2)" we now 
encounter the situation where X = X^^ = (rather 
than {author}, as with the previous DTD). Therefore, 
when processing the recursive call "rewrite($b, 0,0:2)", 
now lines 36-39 of the algorithm are executed, even- 
tually producing the following result a2 ~ 

{ps $b: on title as $t return 
{ps $t : on-first past(*) return 
{ for $a in $b/author return 

<result> {$t} {$a} </result> } } > 

Now, "rewrite($ROOT,0,(52)" yields query F^ differing 



from F2 in the lines 5-8, which must be replaced by 
the above expression aj. 

When evaluating Fj on an XML document com- 
pliant with the second DTD, all author nodes arrive 
before title nodes and are buffered. Encountering a 
title node in the input stream invokes the following 
actions: The value of that particular node is buffered, 
i.e., "on-first past(*)" delays the execution until 
the complete title node has been seen. Then, we it- 
crate over the buffer containing all collected author 
nodes, each time writing the buffered title and the 
current author to the output. In contrast to the worst- 
case scenario above, we only buffer one title at a time 
in addition to the list of all authors. If there is more 
than one title, this strategy is clearly preferable. D 

We next demonstrate that conditional for-loops are 
optimized correspondingly. 

Example 4.5 (|^, XMP, Ql) Let us consider the 
query Qi and its normalization Q'^ from Example 14. 21 
Given a DTD that does not impose any order con- 
straints, e.g., the DTD 

<! ELEMENT bib (book)*> 

<! ELEMENT book (title I publisher I year) *> 

the function call "rcwrite($RDDT, 0, Q^)" rewrites Q'^ 
into the following FluX query Fi : 

1 {ps $ROOT: 

2 on-first pastO return <bib>; 

3 on bib as $bib return 

4 {ps $bib: on book as $b return 

5 {ps $b: 

6 on-first past (publisher .year) return 

7 { if X "then <book> }; 

8 on-first past (publisher .year) return 

9 { for $year in $b/year return 

10 { if X "then {$year}- } } ; 

11 on-first past (publisher .year .title) return 

12 { for $title in $b/title return 

13 { if X "then {$title} } }; 

14 on-first past (publisher .year .title) return 

15 { if X "then </book> > }■ }; 

16 on-first past (bib) return </bib> } 

The "on-first" handler in lines 11-13 delays query 
execution until all title nodes have been buffered and 
all publisher and year nodes have been seen. 

When given a different DTD, ensuring that both 
Ordbook(year, title) and Ordbooklpiiblisher, title) 
hold, the title nodes can be processed in the 
streaming fashion. The query F[ produced by 
"rewrite($ROOT,0,(5i)" with this new DTD differs from 
the above query Fi in the subexpression in lines 11-13 
which must be replaced by 

on title as $title return 
{ if X "then {$title} } 

Consequently, titles will not be buffered at all during 
evaluation of this query. D 



Our rewrite algorithm is well capable of optimizing 
joins over two or more join predicates, as is demon- 
strated in the following example which is not part of 
the XQuery Use Cases. 

Example 4.6 We remain in the bibliography domain 
and consider documents compliant with the DTD 

<! ELEMENT bib (book I article) *> 

<!ELEMENT book (title, (author+ I editor+) ,publisher)> 

<! ELEMENT article (title, author+, journal) > 

The following XQuery Q^ retrieves those authors of 
articles which are coauthored by people who have also 
edited books: 

<results> 

{for $bib in $RDOT/bib return 

{ for $article in $bib/article return 
{ for $book in $bib/book 

where $article/author = $book/editor return 
{ <result> {$article/author} </result> } }}} 
</results> 

For the remainder of this example, we abbreviate 
the join-condition comparing the authors of articles 
with the editors of books by x- Normalization yields 
the following query Q'^: 

1 <results> 

2 i for $bib in $RODT/bib return 

3 { for $article in $bib/article return 

4 { for $book in $bib/book return 

5 { if X "then <result> } 

6 ■[ for $author in $article/author return 

7 -C if X "then {$author} } } 

8 -C if X then </result> > } } } 

9 </results> 

When executing "rewrite($ROOT,0,QJj)" with the DTD 
above, a recursive call "rewrite($bib,0,/3)" is eventu- 
ally invoked for the subexpression /3 of Q3 in lines 3-8. 
As /? is a for-loop with parent variable $bib and asso- 
ciated set X = Xp = {book} ^ 0, line 34 of the algo- 
rithm is executed, returning an expression of the form 
{ps $bib: on-first past (book, article) • ■ • }. 
That is, as no order constraint between article and 
book holds, an on-first handler ensures that all ar- 
ticles and books will be buffered. 

Altogether, "rewrite($ROOT,0,(33)" produces the 
following FluX query Fy,, where a is used as abbrevi- 
ation for the for-loop over books in lines 4-8 of Q'^: 

1 -Cps $RDOT: 

2 on-first pastO return <results>; 

3 on bib as $bib return 

4 -[ps $bib: on-first past (book, article) return 

5 { for $article in $bib/article return a } }; 

6 on-first past (bib) return </results> }■ 

When given a different DTD which imposes an order 
on books and articles, e.g. by the following production 

<! ELEMENT bib (book* , article*) > 



we can evaluate (^3 by buffering only book nodes but 
processing article nodes in a streaming fashion. 

Indeed, when executing "rewrite($RODT,0,(53)" 
with this new DTD, we eventually encounter the 
situation where set X = Xp = 0, and therefore, 
lines 36-39 (rather than line 34, as with the previous 
DTD) are executed. Altogether, the FluX query Fg 
produced now, differs from the above query F3 in the 
subexpression in lines 4-5, which must be replaced by 

4 -[ps $bib: on article as $article return 

5 -[ps $article: on-first past (author) return a}}; 

As all book nodes will have arrived before an article 
node can be encountered, data from books is available 
in buffers once the first article node is being read. 
When processing the children of an article node, we 
first buffer all author nodes before the query can be 
evaluated for the current article. 

During the evaluation of F^ , we therefore only buffer 
the authors of a single article in addition to the data 
already stored on books, whereas the evaluation of F3 
requires the authors of all articles to be buffered. D 

5 Implementation 

In this section, we discuss our implementation of a 
query engine for evaluating FluX queries obtained 
from XQuery" by the rewriting algorithm of the pre- 
vious section. 

We focus on the allocation of buffers and their use 
during query evaluation. Given a FluX query, we stat- 
ically infer the buffers which are actually necessary in 
order to avoid superfluous buffering. Our prefilter- 
ing techniques generalize those of ^3] to the scenario 
where certain parts of the input do not need to be 
buffered - even though they are used by the query - 
because they can be processed on-the-fly. 

Buffers are implemented as lists of SAX events. The 
events stored in a buffer represent well-formed XML 
in the sense that start-element events and end-element 
events are properly nested within each other. This 
renders data read from (a stream replayed from) a 
buffer indistinguishable from data read from the input 
stream. In our implementation, we employ the same 
set of operators for handling both events originating 
from streams and from buffers.'^ 

In the following, we say that a FluX query is in 
normal form if all of its (maximal) XQuery" subex- 
pressions are in normal form. 

Let Q he a, safe FluX query in normal form. The 
FluX query engine identifies all nodes that must be 
stored in buffers, i.e. all nodes compared in join con- 
ditions, the roots of buffered subtrees that are output, 
and buffered nodes over which for-loops iterate. 



^Thus, physical query evaluation proceeds in a way similar 
to that followed in XQRL ^. 



More formally, let a be an XQuery" subexpression 
of Q. We define n($r, a), the set of all buffered paths 
in a starting with variable $r, as n($r, e) = n($r, s) = 
0, n($r,{$r}) = {$r}, n($r,a/?) = n($r, a)un($r, /3), 

n($r, {for $x in $j//a return a}) = 

n($r, a) U {%r/a | $y = $r and Il{%x, a) = 0} 
U {$r/a/w I $y = $r and $x/w g n($x, a)}, 

and n($r, {if x then a}) = n($r,a) U {$r/7r | 
X contains an atomic condition $r/7r RelOp $y/7r' or 
$y/TT' RelOp$r/n}. 

For a variable $r and a safe FluX query Q in normal 
form, we now define n($r) as the union of all n($r, a) 
s.t. a is a maximal XQuery^ subexpression of Q. 

Let T($r) be the prefix tree constructed by merging 
all paths from n($r). Intuitively, the prefix tree defines 
a projection of the input document, as it describes 
which parts of the input tree will be buffered. 

We optimize the prefix tree in order to restrict the 
amount of data being buffered. Let T"'-{$r) be the tree 
obtained from T{$r) by marking each node v iiv either 
occurs in a join condition or the entire subtree rooted 
at node v is output and must therefore be buffered. 
For unmarked nodes in T'"($r), we merely store the 
SAX events for the opening and the closing tag. 

Clearly, if a node is marked and we buffer it to- 
gether with its subtree, we also buffer the subtrees 
of any descendant nodes at the same time. Thus we 
only buffer the data of the topmost marked nodes in 
T™($r). For example, if we need to buffer two sub- 
trees reachable by paths tt and tt' respectively, where 
TT is a prefix of tt', we restrict ourselves to buffering 
the subtree identified by tt. Let TP{$r) be the pruned 
prefix tree obtained from T^^{$r) by successively re- 
moving a subtree rooted at node v' if an ancestor node 
V is marked. We refer to TP{$r) as a buffer tree. 

W.l.o.g., below we will assume that variables in 
queries are used uniquely, i.e., each variable name is 
bound at most at one place in the query. For a safe 
FluX query Q in normal form, let X be the set of 
variables that are free in maximal XQuery" subex- 
pressions of Q. The variables in X are precisely those 
for which we will later define buffers. 

Example 5.1 The following FluX query selects all 
book publishers whose CEO has published articles. 

{ ps $RODT: on bib as $bib return 
■[ ps $bib: on article as $article return 
{ ps larticle: on-first past (author) return 
■[ for $book in $bib/book return 
{ for $p in $book/publisher return 

■[ if $article/author = $book/publisher/ceo 
then {$p}- }}}■}■}} 

Here, X = {$bib, $article} and we compute the sets 



$bib 

i 
book 

I 
publisher* 



$article 
author* 



Figure 3: Buffer trees of variables $bib and Sarticle. 



of buffer paths for each variable in X: 



n($bib) 



n($article) = 



{$bib/book/publisher/ceo, 
$bib /book/publisher} 
(Sarticle /author} 



We construct a buffer tree for each variable in X with 
a nonempty set of buffer paths. Here, we obtain the 
trees shown in Figure (the bullet denotes a marked 
node). Note that the leaf node ceo has been pruned 
off the buffer tree of variable $bib. D 



We evaluate a safe FluX query Q in normal form as 
follows. We compute X and construct the buffer tree 
TP{$r, Q) for each variable $r G X. 

We further associate an evaluator function with 
each variable $r in X. When variable $r is bound 
to a node v of the incoming XML tree, then the eval- 
uator Eval_$r is responsible for handling all events 
generated while processing the children of v. 

We define a buffer buffer _$r for variable $r in the 
evaluator which calls Eval_$r. This buffer is initialized 
on entering the scope of $r and freed on re-entering it. 
The buffer tree of %r can be considered a schema for all 
events stored in buffer buf f er_$r. At the same time, 
the buffer tree determines how the set of evaluators is 
to be extended such that all buffers are correctly filled. 

For a node v in buffer tree 'TP{%r), reachable by 
path $r/ai/ . . . /a„, we implement the corresponding 
buffering strategy in a set of evaluators. Starting with 
Eval_$r and ai, we successively extend the evaluators 
responsible for handling the children of a node labeled 
ai such that the events for the opening and closing 
tag of the respective node will be added to buf f er_$r. 
In cases where no such evaluator exists, we introduce 
new variables and evaluators accordingly. As there 
is at most one case statement in an evaluator for an 
" on a" event under a given node label " a" , it is clear 
where corresponding commands are to be introduced. 
In case a„ is marked, we insert the respective code 
for adding all events corresponding to node a„ and its 
subtree to buf f er_$r. 



Example 5.2 Consider query F^ of Example 14.61 
The set of buffer trees for this query is as in Figure O 
with the publisher node replaced by an editor node. 



Its FluX evaluation strategy is as follows. 

for each event e of Eval_$ROOT, switch(e) { 
case(ofp()): { output "<results>"; } 
case(on bib): { initialize (buff er_$bib) ; 

Eval_$bib(node(e)) ; } 
case(ofp(bib)) : { output "</results>" ; } } 

for each event of Eval_$bib, switch(e) { 
case(on book): { buffer_$bib.add(<book>) ; 
Eval_$book(node(e) ) ; 
buffer_$bib.add(</book>) ; } 
case(on article): { initialize(buf f er_$article) 
Eval_$article(node(e) ) ; } } 

for each event of Eval_$book, switch(e) { 

case(on editor): { buf f er_$bib. add(node(e)) ; } } 

for each event of Eval_$article, switch(e) { 
case(on author) : 

{ buf f er_$article .add(node(e) ) ; } 
case(ofp (author) ) : { execute_subquery(a) ; } } 

At the beginning of the stream, the evaluator of 
variable $ROOT handles the event on-first pastO, 
denoted ofpO above, by writing the opening tag 
<results> to the output. Correspondingly, the event 
of p (bib) signals the end of the stream and the closing 
tag </results> is output. 

Yet when processing the SAX event for the open- 
ing tag of root node bib, the buffer associated with 
variable $bib is initialized and the evaluator for $bib 
takes over to handle all events generated while parsing 
the children of bib. 

Book nodes arriving on the stream are stored 
in the buffer of variable $bib, while editor nodes 
are buffered by the evaluator for variable $book to- 
gether with their complete subtrees. As with all safe 
FluX queries, we may rely on the fact that buffer 
buffer_$bib is filled by the time we encounter the 
first article node. The buffer is not freed until the 
complete subtree under the node bound to variable 
$bib has been parsed. 

Processing the children of an article node, any 
authors are first buffered in buf f er_$article until the 
event of p (author) guarantees that the subquery a of 
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F3 can be executed correctly. 



D 



Join conditions are handled similarly, by buffering 
both constituent paths of the condition. Simple con- 
ditions comparing a path with a constant can be eval- 
uated on the fly while reading the paths, so only a 
Boolean flag is required, which has to be appropriately 
initialized upon entering the relevant variable scope. 

6 Experiments 

In order to assess the merits of the approach presented 
in this paper, we have experimentally evaluated our 
prototype query engine implemented in JAVA using a 



Figure 4: Benchmark results. 

number of queries on data obtained using the XMark 
benchmark generator. 

Our implementation supports the XQuery^ frag- 
ment as defined in Sectional We took selected queries 
of the XMark benchmark and, as XQuery" does not 
include certain features that are used in these queries, 
adapted them correspondingly. In detail, attributes 
were converted into subelements of their parent ele- 
ment in our tests (the XMark DTD was adjusted ac- 
cordingly). Occurrences of the XPath step textO 
were replaced by {$x} expressions that print the whole 
element instead. We eliminated the count ($a;) ag- 
gregations by again outputting $a; instead. XMark 
queries 1, 8, 11, and 13 were adjusted as sketched 
above. We extracted the last FLWR subexpression of 
original query 20 (which computes persons whose in- 
come is not available) for our novel query 20. The 
queries thus obtained can be found in full in Ap- 
pendix 

We used data generated by the XMark xmlgen data 
generation tool (V. 0.96) of the sizes 5MB, 10MB, 
50MB, and 100MB as input data. All tests were per- 
formed with the SUN JDK 1.4.2_03 and the built-in 
SAX parser on an AMD Athlon XP 2000+ (1.67GHz) 
with 512MB RAM running Linux (gcntoo linux us- 
ing kernel 2.6). Our query engine was implemented 
precisely as described in this paper. As a reference 
implementation the Galax query engine (V. 0.3.1) was 
employed with projection turned on 2^. The perfor- 
mance of query evaluation was studied by measuring 
the execution time^ (in seconds) and maximum mem- 
ory consumption (in bytes) of each engine. The mem- 
ory and CPU usage of both query engines were mea- 



^The times taken for query rewriting were negligible and are 
not reported separately in our experiments. 



sured by internal monitoring functions (excluding the 
fixed memory consumption of the Java Virtual Ma- 
chine). 

To give a broader overview over the performance 
of our approach we evaluated our queries additionally 
with a commercial XQucry system of a major com- 
pany that has to remain anonymous and will be called 
AnonX below. Unfortunately, we could not deter- 
mine the exact memory consumption for this system. 
Hence, we only state its execution time. As AnonX 
was not able to parse Query 11 , we are not able to list 
the execution time. 

Figure 01 shows the results of our experiments. To 
evaluate most queries with input greater than 10MB, 
Galax needed more than 500MB of main memory after 
running for a few minutes (which caused the system to 
start swapping). These runs were aborted. Obviously, 
our prototype engine clearly outperforms Galax with 
respect to both execution time and memory consump- 
tion. Queries 1 and 13 are evaluated on-the-fly without 
any buffering because of the order constraints imposed 
by the DTD. Query 20 has to buffer only a single ele- 
ment at a time, which leads to very low memory con- 
sumption in comparison to the traditional approach. 
Queries 8 and 11 perform a join on two subtrees (i.e. 
of people and closed_auction resp. open_auction) 
and therefore inevitably have to buffer elements. Nev- 
ertheless, due to our effective projection scheme only 
a small fraction of the original data is buffered. The 
rapid increase in execution time is due to the fact that 
we compute joins by naive nested loops at the moment. 
(We will work on this orthogonal but vital issue in the 
future.) 

The comparison of the execution times to AnonX 
again shows the competitiveness of our query engine. 
AnonX ran out of memory processing queries marked 
by "-" (the maximum heap size of the Java VM was 
set to 512MB in both cases) and hence did not give 
any results in this case. 

Altogether, our optimization approach seems to 
perform very well with respect to execution time, max- 
imum memory consumption, and the maximum size of 
XML documents that can be processed. 

7 Discussion and Conclusions 

Main memory is probably the most critical resource in 
(streamed) query processing. Keeping main memory 
consumption low is vital to scalability and has - indi- 
rectly ~ a great impact on query engine performance 
in terms of running time. 

The main contribution of this paper is the FluX 
language together with an algorithm for automati- 
cally translating a significant fragment of XQuery into 
equivalent FluX queries. FluX - while intended as an 
internal representation format for queries rather than a 
language for end-users - provides a strong intuition for 
buffer-conscious query processing on structured data 



streams. The algorithm uses schema information to 
schedule FluX queries so as to reduce the use of buffers. 

As evidenced by our experiments, our approach 
indeed dramatically increases the scalability of main 
memory XQucry engines, even though we think we 
are not yet close to exhausting this approach, neither 
with respect to run-time buffer management and query 
processing nor query optimization. 

In particular, further constraints such as cardinality 
constraints derived from the DTD, telling, e.g., that a 
book node has at most one publisher child (let this 
be denoted by publisher G Ikooj,)! could be used to 
simplify XQueries before they are rewritten into FluX 
using rewrite rules such as 



{ for %x in %rla return a } 
{ for $2: in %rla return /3 } 

{ for $a; in %rla return a /3 } 



a G 



which can form the basis of algebraic query optimiza- 
tion for buffer minimization. 

Sequences of for-loops iterating over singletons are 
a natural product of the normalization process that we 
have described. For example, the query 

{ for $b in $RDOT/book return 

■CSb/publisher/nEune} {$b/publisher/address} pl- 
uses a sequence of two loops over publisher in its nor- 
mal form, which can be rewritten into one using the 
rule above. By first merging for-loops, it is often pos- 
sible to obtain FluX queries that require no buffering 
at all, while two subsequent loops over the same path 
generally cause that path to be buffered. 

Another important optimization is to push if- 
expressions - which we have moved down the query 
tree to obtain our normal form - back "up" the ex- 
pression tree as soon as the other simplifications have 
been realized. 
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APPENDIX 

In Appendix^we list the XQueries used as benchmark 
queries in Section El Appendix ^ explains how order 
constraints as well as Past and first-past events can be 
efficiently checked for a given DTD. 

A Benchmark Queries 

In this section we present the queries used in our 
benchmark experiments. As we have briefly sketched 
in Section El we have adapted selected queries of the 



XMark benchmark to suit the capabilities of our proto- 
type implementation. In detail, we made the following 
changes: 

• Attributes were converted into subelements of their 
parent element. For example, 



<person id 
was converted to 



M 11 y 



</person> 



<person> 

<person_id> 

</person_id> 

</person> 

The XMark queries and the schema were adapted 
accordingly. While processing an XML stream 
(generated by the XMark xmlgen data generation 
tool), our XSAX parser converted attributes into 
subelements on-thc-fly. 

• XPath steps like textO were omitted; the whole 
element was printed instead. 

• Aggregations such as count (Sec) were omitted and 
(a subtree of) element $x was written to the output 
instead. 

The current version of our prototype implementa- 
tion already supports an XQuery fragment that is 
slightly larger than the class XQuery" defined in the 
present paper. For example, in the outermost for- 
loops of queries, the variable $ROOT may be omitted, 
and where-conditions may also use statements such as 
Sx/tt > c * $y/7r', for constants c (cf.. Query 11 be- 
low) and empty (Sz/tt) (cf.. Query 20 below). Note 
the latter condition is equivalent to the XQuery" con- 
dition not exists Sx/tt. 

The following queries were used in our experiments 
(for all systems): 

Query 1 

<queryl> 
{. for $b in /site/people/person 
where $b/person_id = 'personO' 
return 
<result> {$b/name} </result> ]■ 
</queryl> 

Query 8 

<query8> 
{. for $p in /site/people/person return 
<item> 

<person> {$p/name} </person> 
< it ems _bought > 
{ for $t in 



/site/closed_auctions/closed_auction 
where $t/buyer/buyer_person = $p/person_id 
return 
<result> {$t} </result> } 
</items_bought> 
</item> } 
</query8> 

Query 11 

<queryll> 
{ for $p in /site/people/person return 
<items> 
{$p/naine}- 

{ for $0 in /site/open_auctions/open_auction 
where $p/prof ile/prof ile_income > 

(5000 * $o/initial) 
return 
{$o/open_auction_id}- }■ 
</items> y 
</queryll> 

Query 13 

<queryl3> 
{ for $i in /site/regions/australia/item return 
<item> 

<naine> {$i/naine} </naine> 
<desc> {$i/description} </desc> 
</item> } 
</queryl3> 

Query 20 

<query20> 
{ for $p in /site/people/person 

where empty($p/person_income) 

return $p }■ 
</query20> 

B Efficient Checking of 
Schema Constraints 

By a marking of a regular expression p we denote a 
regular expression p' such that each occurrence of an 
atomic symbol in p is replaced by the symbol with 
its position among the atomic symbols of p added as 
a subscript. That is, the i-tli occurrence of a sym- 
bol a S symb(p) is replaced by a^. The reverse of a 
marking (indicated by #) is obtained by dropping the 
subscripts. 

All regular expressions in a DTD are one- 
unambiguous |3j. Intuitively, a one- unambiguous reg- 
ular expression p allows for deterministic matching of 
a word w S L{p) using only a one-token lookahead. 

For each one-unambiguous regular expression, an 
equivalent deterministic finite automaton called the 
Glushkov automaton |3j can be constructed efficiently 
- in quadratic time. Glushkov automata have the char- 
acteristic properties that (1) each state in a Glushkov 
automaton (apart from the initial state) corresponds 
to a symbol in the marked regular expression, and (2) 



each transition S{q, a) — p into a state p takes place un- 
der input symbol a = p"^ . We refer to [3| for a formal 
definition of one-unambiguous regular expressions, of 
Glushkov automata, and their construction. 

Let Q = {Q, symb{p),S, go, F) be a Glushkov 
automaton for p and let d*{q,ai . . .a„) = 
(5(. . . (5((5(g, fli), 02), . . . , a„). Let A be the reach- 
ability relation 

{{qi,qj) I 3u e symb{p)* : S*{q.„u) =qj}. 

Obviously, A can be computed in time 0{\Q\'^) by sim- 
ply, for each q £ Q, computing the reachable states in 
the transition graph of Q (for which there is a well- 
known linear time algorithm J17j). 

We define the relation Pastp C Q x E as 

Pastp{qi,a) ^ $qj : qf = a A {q^, q^) £ A. 

Intuitively, Pastp{qi,a) means that on reaching state 
qi, we are past all occurrences of a (i.e., we may not en- 
counter a anymore until we reach the end of the word, 
otherwise it is not in L{p).) Obviously, this relation 
can be obtained from A in time 0{\Q\'^). (Note that 
^ is functional and imposes a partition of Q.) 
Now we can define order constraints as 

Ordp{a,b) ^ Vq : {q* = b) ^ Pastp{q,a). 

It is easy to sec that the relation Ordp can be com- 
puted in time 0(|Q| • \symb{p)\). 

Given a set 5* C E, we can pre-compute a table 

PastTablcpsiq) <^^a £ S : Pastp{q,a). 

We assume that the input XML stream is pro- 
cessed by a SAX parser, which validates each token 
read from the input by simulating a transition of the 
Glushkov automaton associated with the current DTD 
production. In each such transition, we may compute 
first-past ^ by a constant-time lookup in PastTable. 

More precisely, we compute first-past g{ui . . .Un) 
as follows. Initially, 

first-pastpg{e) ;= PastTablcp^sigo)- 
On making the transition d{q, Un) = q' where q = 

S*{qo,Ui...Un-l), 

first-pastpg{ui ...Un) := 

PastTablep^s{S{q,Un)) A -iPastTablcpsiq). 

Thus the SAX parser generates on-first-past punctua- 
tion events (which fire when first-past g becomes true) 
in addition to traditional SAX events with very little 
overhead, namely one validating DFA transition and 
one constant-time lookup per input token read. 



